This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday October 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.
# Import required R libraries
library(fpp3)
library(tidyverse)
library(readxl)
library(seasonal)
library(stringr)
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an Excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
# Read in data
atm_data_raw <- read_excel("data/ATM624Data.xlsx")
# Properly convert the DATE column to match true input
# https://stackoverflow.com/questions/43230470/how-to-convert-excel-date-format-to-proper-date-in-r
atm_data_raw <- atm_data_raw %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"))
# Initial output to see data
#head(atm_data_raw)
# Output summary for high level assessment
summary(atm_data_raw)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
# Check dimensions to understand breadth of data
dim(atm_data_raw)
## [1] 1474 3
# 1474 3
#Cash in hundreds
Dates of dataset start at 2009-05-01 and end with 2010-05-14. This indicates 379 dates, which is 14 more than a single year of 365 days.
Dimensions of the initial dataset indicate 1474 observations, which again indicates more than a single year’s worth of data. The 3 columns are DATE, ATM, and Cash. The DATE indicates the specific data, the ATM indicates which of the 4 ATMs, and Cash represents the total amount withdrawn for the given date and ATM.
Data observations: ATM2: 2 Cash NA values ATM4: Many Cash values with a decimal, but not all, something weird there, also ATM4 appears to have 1 really crazy outlier
Total by day and see if there is an overall relationship to be understood
# Define as tsibble with DATE as index
atm_data_ts <- atm_data_raw %>%
as_tsibble(index = DATE, key = ATM)
# Output tsibble to confirm proper format and definition
atm_data_ts %>%
filter(DATE > "2010-04-30")
## # A tsibble: 14 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-05-01 <NA> NA
## 2 2010-05-02 <NA> NA
## 3 2010-05-03 <NA> NA
## 4 2010-05-04 <NA> NA
## 5 2010-05-05 <NA> NA
## 6 2010-05-06 <NA> NA
## 7 2010-05-07 <NA> NA
## 8 2010-05-08 <NA> NA
## 9 2010-05-09 <NA> NA
## 10 2010-05-10 <NA> NA
## 11 2010-05-11 <NA> NA
## 12 2010-05-12 <NA> NA
## 13 2010-05-13 <NA> NA
## 14 2010-05-14 <NA> NA
A closer look at the data shows 14 observations starting with date 2010-05-01 and ending on 2010-05-14 do not indicate an ATM nor a Cash amount, thus these 14 observations will be ignored in the upcoming evaluation. The removal of thse 14 observations also makes for a cleaner dataset, as now the total observations is 1460, which is exacting 365 for each of the 4 ATMs.
atm_data_ts %>%
autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM Withdrawls")
An initial plot of the time series shows the ATM4 data falls in a range greater than the other three ATMs. With the extreme outlier in ATM4, the plot does not provide much value for comparison across the four ATMS.
# Filter to ATM1 data
atm1_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM1')
# Output summary
summary(atm1_data_ts)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 73.00
## Median :2009-10-30 Mode :character Median : 91.00
## Mean :2009-10-30 Mean : 83.89
## 3rd Qu.:2010-01-29 3rd Qu.:108.00
## Max. :2010-04-30 Max. :180.00
## NA's :3
The summary of data for ATM1 confirms the 365 days (single year) of observations and also indicates 3 missing Cash values.
atm1_data_ts$DATE[is.na(atm1_data_ts$Cash)]
## [1] "2009-06-13" "2009-06-16" "2009-06-22"
The 3 dats with missing Cash values are displayed above. Not sure the significance of the missing data occurring in a small window of time, so will consider imputation.
atm1_data_ts %>%
autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM1")
Initial plot of the ATM1 data appears to show a weekly seasonal component along with a dip between October 2009 and January 2010. There does not appear to be a clear increasing or decreasing trend.
Given the low volume of missing values, three, I will simply impute the data with the median value of the ATM1 dataset.
# Calculate median value for ATM1
median <- median(atm1_data_ts$Cash, na.rm=TRUE)
# Set NAs to median
atm1_data_ts$Cash[is.na(atm1_data_ts$Cash)] <- median
summary(atm1_data_ts)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 73.00
## Median :2009-10-30 Mode :character Median : 91.00
## Mean :2009-10-30 Mean : 83.95
## 3rd Qu.:2010-01-29 3rd Qu.:108.00
## Max. :2010-04-30 Max. :180.00
Summary output confirms no missing data for ATM1 Cash amounts.
Simply updating the ATM1 data tsibble is defined properly.
atm1_data_ts <- atm1_data_ts %>%
mutate(DATE = as_date(DATE)) %>%
update_tsibble(index = DATE)
#atm1_data_ts
To better understand the trend and seasonal nature of the ATM1 dataset, decomposition is performed.
# From: https://stats.stackexchange.com/questions/494013/control-the-period-for-daily-time-series-in-tsibbles
atm1_dcmp <- atm1_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm1_dcmp %>% components() %>% autoplot()
STL decomposition shows the remainder plot has the most impact on the data based on the gray bar on the left. The seasonal plot shows a clear seasonal pattern, but while the seasonal pattern remains the same, the remainder plot shows greater variance toward the end of plot, beginning in February 2010. The trend line confirms no real trend present in the dataset.
components(atm1_dcmp) %>%
as_tsibble() %>%
filter_index("2009-05-01" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")
The above seasonally adjusted trendline confirms that despite a few outliers through December 2009, the seasonal component does well to address the weekly nature of the data, but the plot above indicates starting in February 2010, the seasonal component does not properly define the dataset. Based on the above plot, a new weekly pattern appears to emerge in February 2010.
components(atm1_dcmp) %>%
as_tsibble() %>%
filter_index("2010-02-16" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")
Above plot shows that the seasonally adjusted data does not account for the weekly seasonal nature for the last three months. Clearly a shift in the weekly seasonal pattern has occurred.
atm1_data_ts %>% ACF() %>% autoplot()
The ACF of the ATM1 data shows a clear correlation at lags 7, 14, and 21, which is expected for seasonal data following a weekly pattern.
atm1_data_ts %>% PACF() %>% autoplot()
The PACF plot re-confirms the correlations at lag 7 and 14. Interesting, correlation also appears at lags 2 and 5.
atm1_data_ts %>% gg_season(Cash, period = "week") +
labs(y="$USD (in hundreds)", title="ATM Withdrawals")
To better understand the weekly seasonal pattern, the gg_season plot above does show some sort of shift on Tuesday and Thursday.
atm1_data_ts %>%
gg_subseries(period = "week") +
labs(
y = "$USD (in hundreds)",
title = "ATM Withdrawals"
)
The `gg_subseries plot confirms the shift in February occurs on Tuesday, Wednesday, and Thursday. The other days of the week indicate some variance and outliers, but nothing as dramatic as Tuesday through Thursday.
Before forecasting the data for May 2010, I want to train and test the different model functions for proper evaluation. The models are trained on 292 days, or approximately 80% of the year represented.
# Create training set (assume 1 year of data)
# 292 days for training, approx. 80% of data
train_atm1 <- atm1_data_ts %>%
filter_index("2009-05-01" ~ "2010-02-17")
#train_atm1
# Fit the models
fit_atm1 <- train_atm1 %>%
model(
Naive = NAIVE(Cash),
`Seasonal naive` = SNAIVE(Cash),
`Random walk` = RW(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
fit_atm1 %>% accuracy()
## # A tibble: 7 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 Naive Training 0.284 49.9 37.4 -85.0 120. 2.07 1.77 -0.356
## 2 ATM1 Seasonal naive Training 0.203 28.1 18.0 -83.3 103. 1 1 0.0957
## 3 ATM1 Random walk Training 0.284 49.9 37.4 -85.0 120. 2.07 1.77 -0.356
## 4 ATM1 Arima Training 0.179 21.9 14.3 -66.5 82.7 0.790 0.780 0.00341
## 5 ATM1 ETS_Add Training 0.175 21.4 13.7 -73.3 89.7 0.761 0.763 0.0940
## 6 ATM1 ETS_Mult Training 1.98 21.6 13.7 -75.8 90.6 0.760 0.769 0.0956
## 7 ATM1 ETS_Damp Training 1.46 21.3 13.3 -76.4 90.4 0.737 0.759 0.0756
Using the different model functions from the Hyndman textbook, the best performing model based on RMSE is the ETS model with dampening. Overall, the three ETS models and the ARIMA model all perform well in comparison. The seasonal naive model also performe well with an RMSE slightly higher than the ARIMA model and the three ETS models.
# Generate forecasts for 72 days
fc_atm1 <- fit_atm1 %>% forecast(h = "72 days")
fc_atm1 %>% accuracy(atm1_data_ts)
## # A tibble: 7 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM1 Test -2.20 50.8 37.8 -466. 502. 2.09 1.81 -0.0565
## 2 ETS_Add ATM1 Test -3.25 52.4 38.0 -479. 514. 2.11 1.86 0.00348
## 3 ETS_Damp ATM1 Test -10.4 54.2 40.0 -517. 547. 2.22 1.93 -0.0246
## 4 ETS_Mult ATM1 Test -7.03 52.9 39.2 -494. 527. 2.17 1.88 -0.00808
## 5 Naive ATM1 Test -98.2 104. 98.2 -893. 893. 5.44 3.71 -0.0585
## 6 Random walk ATM1 Test -98.2 104. 98.2 -893. 893. 5.44 3.71 -0.0585
## 7 Seasonal naive ATM1 Test -5.21 64.1 53.7 -460. 509. 2.97 2.28 0.00878
The accuracy of the models on the test dataset shows ARIMA performing the best with an RMSE of 50.77. The three ETS models do also perform well.
# Plot forecasts against actual values
fc_atm1 %>%
autoplot(filter_index(train_atm1, "2010-02-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(atm1_data_ts, "2010-02-10" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
So plotting the seven forecasts along with the original dataset shows the seasonal pattern of the forecasts does not match the seasonal pattern of the actual dataset. Based on the observations in the decomposition section, the weekly pattern has changed, and thus training the model on the first 80% of the data actually misses the new pattern found in the data. Further investigation is needed.
fit_atm1_arima <- train_atm1 %>%
model(ARIMA(Cash))
# Check the residuals of Seasonal Naive
fit_atm1_arima %>% gg_tsresiduals()
To further confirm the assumption of a bad model, I’ve taken the ARIMA model, the best performing based on RMSE, and displayed the residuals above. The ACF plot of the residuals clearly indicates a correlation exists at lags 8 and 20. This indicates the model isn’t quite capturing the correlation of the test data properly.
Given the decomposition and the model evaluations indicate a change in seasonal pattern sometime in February 2010, I will attempt to train and test on just the data beginning on 2010-02-16. Yes, this certainly reduces the amount of data used to build the model, but I believe may better capture the change in the weekly pattern and thus provide better forecasts for the month of May 2010.
# ATM1 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm1 <- atm1_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-30")
train_atm1 <- atm1_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-14")
# Fit the models
fit_atm1 <- train_atm1 %>%
model(
`Seasonal naive` = SNAIVE(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
report(fit_atm1)
## # A tibble: 5 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM1 Seaso… 575. NA NA NA NA <NULL> <NULL> NA NA
## 2 ATM1 Arima 311. -219. 447. 448. 455. <cpl [9… <cpl [0… NA NA
## 3 ATM1 ETS_A… 327. -281. 582. 586. 602. <NULL> <NULL> 276. 330.
## 4 ATM1 ETS_M… 0.155 -308. 636. 641. 657. <NULL> <NULL> 1046. 1437.
## 5 ATM1 ETS_D… 0.405 -329. 683. 692. 710. <NULL> <NULL> 27524. 41523.
## # … with 1 more variable: MAE <dbl>
Now using only the 5 top performing models due to the seasonal nature of the data, the ARIMA model performs the best of the 4 models by evaluating the AICc of 447.
fit_atm1 %>% accuracy()
## # A tibble: 5 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 Seasonal naive Trai… -5.20 24.3 15.9 -214. 230. 1 1 0.444
## 2 ATM1 Arima Trai… 0.138 16.0 11.9 -97.6 143. 0.753 0.660 -0.127
## 3 ATM1 ETS_Add Trai… -1.19 16.6 11.7 -56.5 68.7 0.739 0.684 0.0722
## 4 ATM1 ETS_Mult Trai… -2.40 32.3 20.8 -125. 145. 1.31 1.33 0.480
## 5 ATM1 ETS_Damp Trai… -10.4 166. 61.7 -13.8 68.2 3.89 6.82 0.412
The ARIMA model performs the best on the training set with an RMSE of 16.05, while the ETS Additive performs almost as well as the ARIMA model. The Seasonal Naive model performs third best and the ETS with dampening is clearly not doing well compared to the other four models.
# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm1 <- fit_atm1 %>% forecast(h = "16 days")
fc_atm1 %>% accuracy(new_seasonal_atm1)
## # A tibble: 5 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM1 Test 7.71 12.8 9.50 40.9 43.2 0.599 0.527 0.0154
## 2 ETS_Add ATM1 Test 1.06 11.8 8.88 5.50 15.3 0.560 0.486 -0.337
## 3 ETS_Damp ATM1 Test 46.6 54.0 46.6 57.3 58.4 2.94 2.22 0.0345
## 4 ETS_Mult ATM1 Test 9.17 15.1 11.0 -2.95 26.2 0.694 0.621 -0.161
## 5 Seasonal naive ATM1 Test 0.125 9.64 7.12 1.26 9.83 0.449 0.397 -0.00791
For the accuracy of the five models on the test set, the Seasonal Naive model actually performs the best with an RMSE of 9.64, while ARIMA and ETS Additive, also perform well.
# Plot forecasts against actual values
fc_atm1 %>%
autoplot(filter_index(train_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The plot of the test forecasts confirms the models are correctly capturing the seasonal nature of the data near the end of the dataset.
fit_atm1_arima<- train_atm1 %>%
model(ARIMA(Cash))
# Check the residuals of ARIMA
fit_atm1_arima %>% gg_tsresiduals()
As above, I’ve displayed the residuals of the ARIMA model to assess the appropriateness of the model. The plot of Innovation residuals appears to follow white noise after the first 10 observations or so. The ACF plot indicates no correlation outside of the confidence intervals.
# Consider Box-Cox
lambda <- new_seasonal_atm1 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
new_seasonal_atm1 %>%
autoplot(box_cox(Cash, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed ATM1 withdrawals with $\\lambda$ = ",
round(lambda,2))))
Applying Box-Cox to see if I can squeeze out a little better performance on the test data.
# Forecast for May 2010
fit_atm1_bc <- train_atm1 %>%
model(
`Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
ARIMA(box_cox(Cash, lambda)),
ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A"))
)
report(fit_atm1_bc)
## # A tibble: 3 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM1 Seasonal… 323. NA NA NA NA <NULL> <NULL> NA NA
## 2 ATM1 ARIMA(bo… 175. -206. 417. 418. 423. <cpl [1… <cpl [7… NA NA
## 3 ATM1 ETS_Add 182. -264. 548. 552. 568. <NULL> <NULL> 154. 185.
## # … with 1 more variable: MAE <dbl>
The AICc of the ARIMA model with Box-Cox transformation has improved from 447.73 to 417.98. A promising sign.
fit_atm1_bc %>% accuracy()
## # A tibble: 3 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 Seasonal naive Train… -5.20 24.3 15.9 -214. 230. 1 1 0.444
## 2 ATM1 ARIMA(box_cox… Train… -2.56 16.1 12.3 -105. 127. 0.773 0.661 -0.313
## 3 ATM1 ETS_Add Train… -1.11 16.6 12.0 -66.2 77.5 0.755 0.682 0.0836
The accuracy of the models with Box-Cox show little to no improvement over the models without transformations on the training set. The RMSE of the ARIMA model with the transformation actually increases from 16.04555 to 16.06325 as noted above.
# Generate forecasts for the next 16 days
fc_atm1_bc <- fit_atm1_bc %>% forecast(h = "16 days")
fc_atm1_bc %>% accuracy(new_seasonal_atm1)
## # A tibble: 3 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(box_cox(… ATM1 Test -6.25 11.5 9.35 -32.6 36.2 0.589 0.474 -0.360
## 2 ETS_Add ATM1 Test -0.247 11.8 9.24 -24.8 35.4 0.582 0.487 -0.398
## 3 Seasonal naive ATM1 Test -1.05 9.87 7.88 -21.8 29.4 0.497 0.406 -0.0900
The RMSE of the ARIMA model with Box-Cox transformation on the test data does improve the RMSE compared to the ARIMA model without transformation from 12.812248 to 11.53609.
# Plot forecasts against actual values
fc_atm1_bc %>%
autoplot(filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The above plot shows the forecasts of the models with transformation along with the actual data. Similar to the plot from the models with transformations, the forecasts do follow the seasonal pattern of the data.
Given the assignment is to forecast the data for May 2010 from the provided data, I believe the models generated from the final two and a half months is the appropriate approach. Clearly, a shift in the weekly pattern occurs in February 2010 and without any additional context or history, I assume that shift will continue through May 2010. As the forecast is only 1 month or approximately four more weeks from the end of the provided data, then I choose to believe that new pattern will persist for the next four weeks through May 2010.
atm2_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM2')
atm2_data_ts %>% autoplot(Cash)
# Calculate median value for ATM2
median <- median(atm2_data_ts$Cash, na.rm=TRUE)
# Set NAs to median
atm2_data_ts$Cash[is.na(atm2_data_ts$Cash)] <- median
#atm2_data_ts
atm2_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
) %>%
components() %>% autoplot()
Same thing as ATM1, the seasonal nature of the data changes in the final few months. I’m gonna separate the forecast into just the final section where the weekly seasonal nature shifts.
atm2_data_ts %>% ACF() %>% autoplot()
atm2_data_ts %>% PACF() %>% autoplot()
atm2_data_ts %>% gg_season(Cash, period = "week") +
theme(legend.position = "none") +
labs(y="$ (in hundreds)", title="ATM Withdrawals")
atm2_data_ts %>%
gg_subseries(period = "week") +
labs(
y = "$ (in hundreds)",
title = "ATM Withdrawals"
)
atm2_dcmp <- atm2_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm2_dcmp %>% components() %>% autoplot()
components(atm2_dcmp) %>%
as_tsibble() %>%
filter_index("2009-05-01" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")
components(atm2_dcmp) %>%
as_tsibble() %>%
filter_index("2010-02-16" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")
Now, for ATM3.
atm3_data_ts_all <- atm_data_ts %>%
filter(ATM == 'ATM3')
summary(atm3_data_ts_all)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 0.0000
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.0000
## Median :2009-10-30 Mode :character Median : 0.0000
## Mean :2009-10-30 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :96.0000
From summary output, the date range (2009-05-01 to 2010-04-03) matches expectations given the data from ATM1 and ATM2. The Cash column does not contain any missing data but does indicate curious aspects. The range is 0 to 96 which might be odd to have 0, but the median is also 0, which certainly indicates a curiosity about the data. The mean is also quite low, less than 1.
atm3_data_ts_all %>% autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")
A simply plot of the data clearly shows unexpected information. Only a few days at the end of the dataset contain anything above zero.
atm3_data_ts <- atm3_data_ts_all %>%
filter(ATM == 'ATM3', Cash > 0)
atm3_data_ts
## # A tsibble: 3 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM3 96
## 2 2010-04-29 ATM3 82
## 3 2010-04-30 ATM3 85
By removing every date with a value equal to 0, only three dates remain - the final three dates of the dataset.
summary(atm3_data_ts)
## DATE ATM Cash
## Min. :2010-04-28 Length:3 Min. :82.00
## 1st Qu.:2010-04-28 Class :character 1st Qu.:83.50
## Median :2010-04-29 Mode :character Median :85.00
## Mean :2010-04-29 Mean :87.67
## 3rd Qu.:2010-04-29 3rd Qu.:90.50
## Max. :2010-04-30 Max. :96.00
For those three observations, the range is 82 to 96 with a mean of 87.67.
atm3_data_ts %>% autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")
The plot confirms that only three dates contain a Cash amount greater than zero.
A decision much be made in how to approach this dataset. Of 365 days, 362 contain a Cash amount of zero and the final three dates contain a Cash amount meeting expectations. Without history or context of ATM3, I will make the assumption that ATM3 was broken or in some way not usable for everyday with a Cash amount of zero. My assumption would follow that starting on 2010-04-28 ATM3 was fixed or made active, and thus can be assumed working and active for the month of May 2010.
With so little data to build a model, I will choose the MEAN function based on the three dates containing Cash amounts above zero.
fit_atm3 <- atm3_data_ts %>% model(MEAN(Cash))
report(fit_atm3)
## Series: Cash
## Model: MEAN
##
## Mean: 87.6667
## sigma^2: 54.3333
The model returns an expected value of 87.67 as the mean, which was indicated in the summary of the data.
# Forecast for 31 days of May 2010
fc_atm3 <- fit_atm3 %>% forecast(h = "31 days")
fc_atm3 %>%
autoplot(filter_index(atm3_data_ts_all, "2009-05-01" ~ "2010-05-31"), level = NULL) +
autolayer(
filter_index(atm1_data_ts, "2010-04-28" ~ "2010-05-31"),
colour = "black"
) +
labs(y="Cash (in hundreds $USD)", title="ATM3 Withdrawals with Forecast") +
guides(colour = guide_legend(title = "Forecast"))
Not a very interesting plot of the forecasted data in relation to the ATM3 dataset, but given the minimal amount of data, the mean forecast is a constant line for the month of May 2010.
There’s only 3 values with data above zero, so I’m assuming there was absolutely no activity but the ATM3 existed, or else it would have NA.
atm4_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM4')
summary(atm4_data_ts)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.563
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 124.334
## Median :2009-10-30 Mode :character Median : 403.839
## Mean :2009-10-30 Mean : 474.043
## 3rd Qu.:2010-01-29 3rd Qu.: 704.507
## Max. :2010-04-30 Max. :10919.762
#atm4_data_ts
As expected, the summary for the ATM4 dataset falls on the dates 2009-05-01 through 2010-04-30. The Cash column indicats no missing values. The range of the Cash column is 1.563 to 10919.762 with a mean of 474.043. The low value is certainly low, but thankfully not zero, as addressed in ATM3. The high value is quite high compared to the mean and the third quartile. Perhaps an outlier or a few exists in the ATM4 data.
atm4_data_ts %>% autoplot(Cash) +
labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")
The plot of the ATM4 data shows a clear outlier in February 2010.
atm4_data_ts %>%
filter(Cash > 3000)
## # A tsibble: 1 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 10920.
The day of the outlier turns out to be 2010-02-09.
atm4_dcmp <- atm4_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm4_dcmp %>% components() %>% autoplot()
Given the weekly seasonal pattern worked for ATM1 and ATM2, let’s try decomposition on ATM4. Given the plot above, the outlier makes the plots near impossible to evaluate.
Let’s the remove the outlier by imputing the median value in its place.
# Calculate median value for ATM
# Straightforward approach to impute data
median <- median(atm4_data_ts$Cash, na.rm=TRUE)
# Get rid of outlier from ATM4
atm4_data_ts$Cash[atm4_data_ts$Cash > 3000] <- median
atm4_data_ts %>% autoplot(Cash) +
labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")
By removing the outlier, the data does appear more approachable. Overall, I don’t detect a clear trend, perhaps a slight decrease. Given the data represents a single year, I don’t expect to find a cyclic component, but perhaps a weekly seasonal component exists. Let’s try the decomposition again, now excluding the outlier.
atm4_dcmp <- atm4_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm4_dcmp %>% components() %>% autoplot()
The remainder component does play a larger factor than the seasonal component. As identified in ATM1 and ATM2, the trend component has minimal impact. Let’s try the seasonally adjusted plot.
components(atm4_dcmp) %>%
as_tsibble() %>%
filter_index("2009-05-01" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")
The seasonally-adjusted trendline plot above does not convey any clear weekly pattern.
atm4_data_ts %>% ACF() %>% autoplot()
The ACF plot indicates correlation at lags 7, 14, and 21 as expected. Also, correlation appears at lag 10.
atm4_data_ts %>% PACF() %>% autoplot()
The PACF shows correlation at lags 7, 10, and 19.
atm4_data_ts %>% gg_season(Cash, period = "week") +
labs(y="$ (in hundreds)", title="ATM4 Withdrawals")
I’ll admit the above plot of gg_season does not provide much value except perhaps for some nice colors.
atm4_data_ts %>%
gg_subseries(period = "week") +
labs(
y = "$ (in hundreds)",
title = "ATM Withdrawals"
)
The `gg_subseries`` plot above does indicate a change in pattern on Tuesday, Wednesday, and Thursday, similar to that found in ATM1 and ATM2, but not quite as drastic and clear.
atm4_data_post0209 <- atm4_data_ts %>%
filter_index("2010-02-10" ~ "2010-04-30")
atm4_data_post0209 %>%
autoplot(Cash) +
labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")
atm4_dcmp <- atm4_data_post0209 %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm4_dcmp %>% components() %>% autoplot()
components(atm4_dcmp) %>%
as_tsibble() %>%
filter_index("2010-02-10" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")
Still too irregular. Try ARIMA and ETS and see what sort of results I can get.
Given I’m struggling to find a viable weekly seasonal pattern through visual inspection, I will apply the different seasonal models through the train and test approach.
# Create training set (1 year of data)
# 292 days for training, approx. 80% of data
train_atm4 <- atm4_data_ts %>%
filter_index("2009-05-01" ~ "2010-02-17")
# Fit the models
fit_atm4 <- train_atm4 %>%
model(
`Seasonal naive` = SNAIVE(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
report(fit_atm4)
## # A tibble: 5 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl>
## 1 ATM4 Seasonal … 2.09e+5 NA NA NA NA <NULL> <NULL> NA
## 2 ATM4 Arima 1.25e+5 -2134. 4274. 4274. 4285. <cpl [7… <cpl [0… NA
## 3 ATM4 ETS_Add 1.05e+5 -2521. 5063. 5063. 5099. <NULL> <NULL> 101739.
## 4 ATM4 ETS_Mult 5.14e-1 -2469. 4958. 4959. 4995. <NULL> <NULL> 103349.
## 5 ATM4 ETS_Damp 4.88e-1 -2478. 4981. 4982. 5029. <NULL> <NULL> 106682.
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>
fit_atm4 %>% accuracy()
## # A tibble: 5 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 Seasonal naive Trai… 2.11 456. 347. -366. 422. 1 1 0.0620
## 2 ATM4 Arima Trai… 0.0482 352. 295. -501. 535. 0.850 0.772 0.0819
## 3 ATM4 ETS_Add Trai… -9.31 319. 249. -421. 450. 0.717 0.699 0.110
## 4 ATM4 ETS_Mult Trai… 11.5 321. 251. -412. 444. 0.723 0.705 0.108
## 5 ATM4 ETS_Damp Trai… -18.3 327. 257. -434. 463. 0.741 0.716 0.0931
# Generate forecasts for 72 days
fc_atm4 <- fit_atm4 %>% forecast(h = "72 days")
fc_atm4 %>% accuracy(atm4_data_ts)
## # A tibble: 5 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM4 Test -19.2 319. 266. -777. 808. 0.768 0.699 -0.0357
## 2 ETS_Add ATM4 Test -20.7 386. 319. -918. 960. 0.919 0.847 0.0325
## 3 ETS_Damp ATM4 Test -34.1 400. 330. -1017. 1059. 0.951 0.878 0.00763
## 4 ETS_Mult ATM4 Test -5.37 387. 321. -913. 958. 0.925 0.849 0.00836
## 5 Seasonal naive ATM4 Test -101. 520. 417. -1484. 1530. 1.20 1.14 -0.0989
# Plot forecasts against actual values
fc_atm4 %>%
autoplot(filter_index(train_atm4, "2010-02-15" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(atm4_data_ts, "2010-02-15" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM4 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The above plot indicates the models are not following the weekly seasonal pattern. Let me try the truncated version similar to ATM1 and ATM2.
# ATM4 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm4 <- atm4_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-30")
train_atm4 <- atm4_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-14")
# Fit the models
fit_atm4 <- train_atm4 %>%
model(
`Seasonal naive` = SNAIVE(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
report(fit_atm4)
## # A tibble: 5 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl>
## 1 ATM4 Seasonal … 2.09e+5 NA NA NA NA <NULL> <NULL> NA
## 2 ATM4 Arima 1.11e+5 -419. 841. 841. 845. <cpl [0… <cpl [0… NA
## 3 ATM4 ETS_Add 9.95e+4 -447. 913. 918. 934. <NULL> <NULL> 8.41e4
## 4 ATM4 ETS_Mult 5.76e-1 -452. 924. 928. 944. <NULL> <NULL> 4.57e5
## 5 ATM4 ETS_Damp 3.74e+0 -517. 1060. 1068. 1087. <NULL> <NULL> 4.09e6
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>
fit_atm4 %>% accuracy()
## # A tibble: 5 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 Seasonal … Train… -1.05e+ 1 453. 349. -524. 578. 1 1 0.0769
## 2 ATM4 Arima Train… 1.94e-10 330. 280. -747. 780. 0.803 0.728 0.0556
## 3 ATM4 ETS_Add Train… -3.45e+ 1 290. 228. -204. 236. 0.652 0.640 0.0543
## 4 ATM4 ETS_Mult Train… -1.75e+ 2 676. 391. -301. 328. 1.12 1.49 0.512
## 5 ATM4 ETS_Damp Train… -1.51e+ 1 2022. 1372. -4.63 506. 3.93 4.46 0.803
# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm4 <- fit_atm4 %>% forecast(h = "16 days")
fc_atm4 %>% accuracy(new_seasonal_atm4)
## # A tibble: 5 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM4 Test -135. 278. 229. -924. 937. 0.655 0.613 -0.308
## 2 ETS_Add ATM4 Test -147. 274. 213. -636. 645. 0.610 0.605 0.0327
## 3 ETS_Damp ATM4 Test 616. 690. 616. 670. 670. 1.77 1.52 -0.194
## 4 ETS_Mult ATM4 Test -92.1 242. 189. -618. 632. 0.542 0.534 -0.00529
## 5 Seasonal naive ATM4 Test -244. 449. 370. -1178. 1201. 1.06 0.991 0.219
# Plot forecasts against actual values
fc_atm4 %>%
autoplot(filter_index(train_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM4 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
fit_atm4_ets <- train_atm4 %>%
model(ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")))
# Check the residuals of ARIMA
fit_atm4_ets %>% gg_tsresiduals()
ETS Multiplicative appears to perform best
Box-cox
# Consider Box-Cox
lambda <- new_seasonal_atm4 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
new_seasonal_atm4 %>%
autoplot(box_cox(Cash, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed ATM4 withdrawals with $\\lambda$ = ",
round(lambda,2))))
fit_atm4_bc <- train_atm4 %>%
model(
`Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
ARIMA(box_cox(Cash, lambda)),
ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
)
report(fit_atm4_bc)
## # A tibble: 4 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM4 Seasonal… 53.7 NA NA NA NA <NULL> <NULL> NA NA
## 2 ATM4 ARIMA(bo… 23.6 -176. 360. 361. 368. <cpl [1… <cpl [0… NA NA
## 3 ATM4 ETS_Add 25.0 -206. 432. 437. 453. <NULL> <NULL> 21.1 18.5
## 4 ATM4 ETS_Mult 0.135 -216. 451. 456. 472. <NULL> <NULL> 30.5 31.0
## # … with 1 more variable: MAE <dbl>
fit_atm4_bc %>% accuracy()
## # A tibble: 4 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 Seasonal naive Train… -10.5 453. 349. -524. 578. 1 1 0.0769
## 2 ATM4 ARIMA(box_cox(… Train… 60.6 293. 239. -218. 257. 0.684 0.646 -0.0113
## 3 ATM4 ETS_Add Train… 51.8 298. 218. -126. 158. 0.626 0.658 0.0641
## 4 ATM4 ETS_Mult Train… 11.0 370. 283. -310. 348. 0.810 0.816 0.144
# Generate forecasts for the next 16 days
fc_atm4_bc <- fit_atm4_bc %>% forecast(h = "16 days")
fc_atm4_bc %>% accuracy(new_seasonal_atm4)
## # A tibble: 4 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(box_cox(C… ATM4 Test -174. 321. 246. -1011. 1025. 0.705 0.708 0.380
## 2 ETS_Add ATM4 Test -207. 341. 275. -737. 747. 0.788 0.752 0.0637
## 3 ETS_Mult ATM4 Test -317. 459. 363. -922. 930. 1.04 1.01 0.401
## 4 Seasonal naive ATM4 Test -616. 823. 705. -2079. 2094. 2.02 1.82 0.433
# Plot forecasts against actual values
fc_atm4_bc %>%
autoplot(filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM4 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
Box-Cox was not better, so go with model without transformation
atm_data_all <- bind_rows(atm1_data_ts,
atm2_data_ts,
atm3_data_ts_all,
atm4_data_ts)
atm_data_total_by_day <- atm_data_all %>%
index_by(DATE) %>%
summarize(Cash_Total = sum(Cash))
atm_data_total_by_day
atm_data_total_by_day %>% autoplot(Cash_Total)
# Seasonally adjusted plot
atm_dcmp <- atm_data_total_by_day %>%
model(stl = STL(Cash_Total ~ trend(window=Inf) + season(period=7, window="periodic")))
atm_dcmp %>% components() %>% autoplot()
components(atm_dcmp) %>%
as_tsibble() %>%
autoplot(Cash_Total, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
# Read in data
power_data_raw <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")
# Change column name to 'Month'
names(power_data_raw)[names(power_data_raw) == 'YYYY-MMM'] <- 'Month'
# Display first 5 rows for visual inspection
#head(power_data_raw)
# Display summary for initial assessment
summary(power_data_raw)
## CaseSequence Month KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
The summary of the residential power data indicates 192 months, which is exactly 16 years. The KWH column shows 1 missing value with a range of 770523 to 10655730. The minimum value being so distanct from the first quartile value may indicate an outlier. The CaseSequence column appears to be a integer unique identifer for each month, so will ignore moving forward.
# Display dimensions for assessment
# should be 16 years of data 1998-2013
# Yep, 192/12 = 16
dim(power_data_raw)
## [1] 192 3
The dimensions output confirms the summary output, 192 observations with three columns.
power_data_ts <- power_data_raw %>%
mutate(Month = yearmonth(Month)) %>%
mutate(KWH = KWH/1e3) %>% # In thousands
select(-c(CaseSequence)) %>%
as_tsibble(index = Month)
# Output first 5 rows of tsibble
#head(power_data_ts)
# Inital plot of data
power_data_ts %>%
autoplot(KWH)
After converting the raw data to a tsibble with the Month as the index, we do see a missing value perhaps sometime in 2008 and an outlier in early 2010.
power_data_ts$Month[is.na(power_data_ts$KWH)]
## <yearmonth[1]>
## [1] "2008 Sep"
Month with missing value is “2008 Sep”.
power_data_ts$Month[power_data_ts$KWH < 3000]
## <yearmonth[2]>
## [1] NA "2010 Jul"
Month with outlier is “2010 Jul”.
First observations, data is Monthly, so I’d expect a seasonal component. 1 month is missing KWH has 1 NA value (2008 Sep) Considering imputing with median Sep Value Outlier (with very small value in July 2010) Considering imputing with median Jul Value
Given the data is monthly and appears to follow a clear seasonal pattern, I will impute the missing value and the outlier value by calculating the median value for each month for imputation.
First, I will find the median value for all September entries.
# I want all the September values
sep_data <- power_data_ts %>%
filter(str_detect(Month, "Sep"))
sep_data <- as_tibble(sep_data)
sep_kwh_med <- sep_data %>%
summarise(Median = median(KWH, na.rm = TRUE))
sep_kwh_med
## # A tibble: 1 × 1
## Median
## <dbl>
## 1 7667.
# Median 7666.97
power_data_ts[129, 2] <- sep_kwh_med
Median value for all other September values is 7666.97.
# I want all the July values
jul_data <- power_data_ts %>%
filter(str_detect(Month, "Jul"))
jul_data <- as_tibble(jul_data)
jul_kwh_med <- jul_data %>%
summarise(Median = median(KWH, na.rm = TRUE))
jul_kwh_med
## # A tibble: 1 × 1
## Median
## <dbl>
## 1 7679.
# Median 7678.623
power_data_ts[151, 2] <- jul_kwh_med
Median value for all other July values is 7695.942.
power_data_ts %>% autoplot(KWH)
Above plot shows the residential customer power data with imputations. Clearly a seasonal pattern exists along with an increasing variance near the end of the plot. Perhaps a Box-Cox transformation could be useful. Also, an slight increasing trend does appear, at least for the second half of the plot.
power_dcmp <- power_data_ts %>%
model(stl = STL(KWH ~ trend(window=Inf) + season(period=12, window="periodic")))
power_dcmp %>% components() %>% autoplot()
The decomposition of the data indicates an increasing trend along with a seasonal pattern and remainder component that appear equal in weight.
components(power_dcmp) %>%
as_tsibble() %>%
autoplot(KWH, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "KWH (in thousands)", title = "Seasonally Adjusted Trendline")
The plot of the seasonally adjusted data indicates the seasonal pattern does persist through the full dataset with a few exceptions throughout.
power_data_ts %>% ACF() %>% autoplot()
ACF plot shows the highest correlation at lag 12, which is expected for monthly data over the course of several years.
power_data_ts %>% PACF() %>% autoplot()
PACF plot indicates autocorrelation at lags 1, 2, 5, 11, and 12, interesting result for monthly data in which I expected the PACF plot to show the higest correlation at lag 12.
power_data_ts %>% gg_season(KWH, period = "year") +
labs(y="KWH (in thousands)", title="Residential Customer Power Usage")
The gg_season plot certainly shows a yearly pattern with low usage in the spring and fall and higher usage in the summer and winter.
power_data_ts %>%
gg_subseries(period = "year") +
labs(y="KWH (in thousands)", title="Residential Customer Power Usage")
TODO: The gg_subseries plot indicates:
TODO: Too soon with this
# Seasonal differencing
power_data_ts %>%
gg_tsdisplay(difference(KWH, 12),
plot_type='partial', lag=36) +
labs(title="Seasonally differenced", y="")
# Train on 154 months, which is 12 years and 8 months (80% of the total)
# Total is 192 months (16 years)
train_power_data <- power_data_ts %>%
filter_index("1998-Jan" ~ "2010-Oct")
Calculate lambda for the Box-Cox transformation
lambda <- power_data_ts %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
train_power_data %>%
autoplot(box_cox(KWH, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed ATM1 withdrawals with $\\lambda$ = ",
round(lambda,2))))
# Fit the models
fit_power_bc <- train_power_data %>%
model(
Naive = NAIVE(box_cox(KWH, lambda)),
`Seasonal naive` = SNAIVE(box_cox(KWH, lambda)),
`Random walk` = RW(box_cox(KWH, lambda)),
Arima = ARIMA(box_cox(KWH, lambda)),
ETS_Add = ETS(box_cox(KWH, lambda) ~ error("A") + trend("A") + season("A")),
ETS_Mult = ETS(box_cox(KWH, lambda) ~ error("M") + trend("A") + season("M")),
ETS_Damp = ETS(box_cox(KWH, lambda) ~ error("M") + trend("Ad") + season("M"))
)
report(fit_power_bc)
## # A tibble: 7 × 11
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 Naive 1.21e-3 NA NA NA NA <NULL> <NULL> NA NA
## 2 Seasona… 3.58e-4 NA NA NA NA <NULL> <NULL> NA NA
## 3 Random … 1.21e-3 NA NA NA NA <NULL> <NULL> NA NA
## 4 Arima 2.10e-4 398. -785. -785. -771. <cpl [1… <cpl [1… NA NA
## 5 ETS_Add 2.22e-4 268. -503. -498. -451. <NULL> <NULL> 1.99e-4 2.05e-4
## 6 ETS_Mult 1.33e-5 269. -504. -500. -452. <NULL> <NULL> 1.97e-4 2.04e-4
## 7 ETS_Damp 1.35e-5 268. -501. -496. -446. <NULL> <NULL> 1.99e-4 2.08e-4
## # … with 1 more variable: MAE <dbl>
fit_power_bc %>% accuracy()
## # A tibble: 7 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Naive Training -6.45 1289. 1069. -2.21 17.2 1.92 1.80 0.182
## 2 Seasonal naive Training 69.0 715. 557. 0.478 8.79 1 1 0.267
## 3 Random walk Training -6.45 1289. 1069. -2.21 17.2 1.92 1.80 0.182
## 4 Arima Training 7.10 518. 397. -0.309 6.34 0.713 0.725 0.0885
## 5 ETS_Add Training 14.8 542. 422. -0.447 6.69 0.758 0.758 0.324
## 6 ETS_Mult Training 14.0 539. 414. -0.480 6.55 0.744 0.754 0.311
## 7 ETS_Damp Training 53.2 543. 414. 0.167 6.50 0.744 0.760 0.296
The best RMSE on the training data was attained by the ARIMA model followed by the three ETS models.
# Generate forecasts for 38 months, so we'll see if it picks up the seasonal nature
fc_power_bc <- fit_power_bc %>% forecast(h = "38 months")
fc_power_bc %>% accuracy(power_data_ts)
## # A tibble: 7 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima Test 621. 1034. 757. 7.25 9.42 1.36 1.45 0.229
## 2 ETS_Add Test 633. 1084. 798. 7.33 9.99 1.43 1.52 0.175
## 3 ETS_Damp Test 691. 1107. 820. 8.13 10.2 1.47 1.55 0.172
## 4 ETS_Mult Test 590. 1038. 751. 6.77 9.36 1.35 1.45 0.164
## 5 Naive Test -1348. 2580. 2155. -23.4 32.0 3.87 3.61 0.689
## 6 Random walk Test -1348. 2580. 2155. -23.4 32.0 3.87 3.61 0.689
## 7 Seasonal naive Test 303. 1018. 850. 2.77 11.0 1.53 1.42 0.419
The best RMSE is the Seasonal Naive model followed closely by the Arima model and the ETS model with Multiplicative.
# Plot forecasts against actual values
fc_power_bc %>%
autoplot(filter_index(power_data_ts, "2010-Nov" ~ "2013-Dec"), level = NULL) +
autolayer(
filter_index(power_data_ts, "2010-Nov" ~ "2013-Dec"),
colour = "black"
) +
labs(
y = "KWH (in thousands)",
title = "Power Forecast"
) +
guides(colour = guide_legend(title = "Forecast"))
Let’s forecast for the year 2014 with the three best performing models.
# Generate forecasts for 50 months, which covers the end of the provided time series and the next 12 months of 2014
fc_power_bc14 <- fit_power_bc %>% forecast(h = "50 months")
fc_power_bc14 %>%
autoplot(filter_index(power_data_ts, "2010-Nov" ~ "2014-Dec"), level = NULL) +
autolayer(
filter_index(power_data_ts, "2010-Nov" ~ "2014-Dec"),
colour = "black"
) +
labs(
y = "KWH (in thousands)",
title = "Power Forecast"
) +
guides(colour = guide_legend(title = "Forecast"))
Go with the ARIMA model.
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
library(lubridate)
# Read in data
pipe1_data_raw <- read_excel("data/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
pipe2_data_raw <- read_excel("data/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
# From: https://stackoverflow.com/questions/53635818/convert-datetime-from-excel-to-r
pipe1_data_raw$`Date Time` <- as.POSIXct(pipe1_data_raw$`Date Time`,
origin="1899-12-30",
tz="GMT")
pipe2_data_raw$`Date Time` <- as.POSIXct(pipe2_data_raw$`Date Time`,
origin="1899-12-30",
tz="GMT")
# Initial output to see data
head(pipe1_data_raw)
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 00:24:06 23.4
## 2 2015-10-23 00:40:02 28.0
## 3 2015-10-23 00:53:51 23.1
## 4 2015-10-23 00:55:40 30.0
## 5 2015-10-23 01:19:17 6.00
## 6 2015-10-23 01:23:58 15.9
head(pipe2_data_raw)
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 18.8
## 2 2015-10-23 02:00:00 43.1
## 3 2015-10-23 03:00:00 38.0
## 4 2015-10-23 04:00:00 36.1
## 5 2015-10-23 05:00:00 31.9
## 6 2015-10-23 06:00:00 28.2
summary(pipe1_data_raw)
## Date Time WaterFlow
## Min. :2015-10-23 00:24:06 Min. : 1.067
## 1st Qu.:2015-10-25 11:21:35 1st Qu.:13.683
## Median :2015-10-27 20:07:30 Median :19.880
## Mean :2015-10-27 20:49:15 Mean :19.897
## 3rd Qu.:2015-10-30 08:24:51 3rd Qu.:26.159
## Max. :2015-11-01 23:35:43 Max. :38.913
summary(pipe2_data_raw)
## Date Time WaterFlow
## Min. :2015-10-23 01:00:00 Min. : 1.885
## 1st Qu.:2015-11-02 10:45:00 1st Qu.:28.140
## Median :2015-11-12 20:30:00 Median :39.682
## Mean :2015-11-12 20:30:00 Mean :39.556
## 3rd Qu.:2015-11-23 06:15:00 3rd Qu.:50.782
## Max. :2015-12-03 16:00:00 Max. :78.303
dim(pipe1_data_raw)
## [1] 1000 2
dim(pipe2_data_raw)
## [1] 1000 2
# Calculate the average flow per hour for pipe1
pipe1_data_raw$Date <- as.Date(pipe1_data_raw$`Date Time`)
pipe1_data_raw$Time <- hour(pipe1_data_raw$`Date Time`) + 1
pipe1_data_raw <- pipe1_data_raw %>%
group_by(Date, Time) %>%
summarise(WF_AVG = mean(WaterFlow)) %>% ungroup()
pipe1_data_raw$`Date Time` <- with(pipe1_data_raw, ymd_h(paste(Date, Time)))
pipe1_data_raw <- pipe1_data_raw %>%
select(c(`Date Time`, WF_AVG)) %>%
rename(WaterFlow = WF_AVG)
summary(pipe1_data_raw)
## Date Time WaterFlow
## Min. :2015-10-23 01:00:00 Min. : 8.923
## 1st Qu.:2015-10-25 11:45:00 1st Qu.:17.033
## Median :2015-10-27 23:30:00 Median :19.784
## Mean :2015-10-27 23:38:53 Mean :19.893
## 3rd Qu.:2015-10-30 11:15:00 3rd Qu.:22.789
## Max. :2015-11-02 00:00:00 Max. :31.730
# Define as tsibble
pipe1_data_ts <- pipe1_data_raw %>%
as_tsibble(index = `Date Time`)
pipe1_data_ts %>% autoplot()
pipe2_data_ts <- pipe2_data_raw %>%
as_tsibble(index = `Date Time`)
pipe2_data_ts %>%
filter_index("2015-10-23" ~ "2015-11-01") %>%
autoplot() +
autolayer(pipe1_data_ts)
#30#